This document is for a workshop to provide some tools, tips, packages etc. that make data processing and visualization in R easier. It is oriented toward those who have had some exposure to R in an applied fasion, but would also be useful to someone coming to R from another programming language. It is not an introduction to R, nor does this document have anything to do with statistical analysis.
Note that this used to be slide-based, and so it still has a lot of short sentences. I got tired of the limitations of slides for code and other presentation, so it is now in this format.
Color coding in text:
R has several core data structures:
Vectors form the basis of R data structures.
Two main types- atomic and lists, but I will treat lists separately.
Here is an R vector. The elements of the vector are numeric values.
x = c(1, 3, 2, 5, 4)
x[1] 1 3 2 5 4
All elements of an atomic vector are the same type. Examples include:
A important type of vector is a factor. Factors are used to represent categorical data structures.
x = factor(1:3, labels=c('q', 'V', 'what the heck?'))
x[1] q V what the heck?
Levels: q V what the heck?
The underlying representation is numeric, but it is important to remember that factors are categorical.
They can’t be used as numbers would be, as the following demonstrates.
as.numeric(x)[1] 1 2 3
sum(x)Error in Summary.factor(structure(1:3, .Label = c("q", "V", "what the heck?": 'sum' not meaningful for factors
With multiple dimensions, we are dealing with arrays.
Matrices are 2-d arrays, and extremely commonly used for scientific computing.
The vectors making up a matrix must all be of the same type.
Creating a matrix can be done in a variety of ways.
# create vectors
x = 1:4
y = 5:8
z = 9:12
rbind(x, y, z) # row bind
cbind(x, y, z) # column bind
matrix(c(x, y, z), nrow=3, ncol=4, byrow=TRUE)Lists in R are highly flexible objects, and probably the most commonly used for data science.
They can contain anything as their elements, even other lists.
Here is a list. We use the list function to create it.
x = list(1, "apple", list(3, "cat"))
x[[1]]
[1] 1
[[2]]
[1] "apple"
[[3]]
[[3]][[1]]
[1] 3
[[3]][[2]]
[1] "cat"
We often want to loop some function over a list.
for(elem in x) print(class(elem))[1] "numeric"
[1] "character"
[1] "list"
Lists can, and often do, have named elements.
x = list("a" = 25, "b" = -1, "c" = 0)
x["b"]$b
[1] -1
data.frames are a very commonly used data structure. They do not have to have the same type of element, and this is because the data.frame class is actually just a list.
But they can also be indexed by row or column as well.
There are other very common types of object classes associated with packages that are both a data.frame and other type of structure.
mydf = data.frame(a = c(1,5,2),
b = c(3,8,1))We can add row names also.
rownames(mydf) = paste0('row', 1:3)
mydf a b
row1 1 3
row2 5 8
row3 2 1
Create an object that is a matrix and/or a data.frame, and inspect its class or structure.
mydf = data.frame(A=1:3, B=letters[1:3])Create a list of 3 elements, the first of which contains character strings, the second numbers, and the third, the data.frame or matrix you just created.
mylist = list(c('a', 'b'), 1:3, mydf)How is a factor different from a character vector?
How is a data.frame the same as and different from a matrix?
How is a data.frame the same as and different from a list?
Standard methods of reading in data include the following functions:
Base R also comes with the foreign package for reading in other types of files, especially other statistical packages. However, while you may see it still in use, it’s not as useful as what’s found in other packages.
haven: Package to read in foreign statistical files
readxl: for excel files
rio: uses haven, readxl etc. but with just two functions for everything
readr: Faster versions of base R functions
These make assumptions after an initial scan of the data, but if you don’t have ‘big’ data, this won’t help much.
However, they actually can be used as a diagnostic.
data.table: faster read.table
Typically faster than readr approaches.
Note that R can handle many types of data.
Some examples:
And many, many others.
feather: designed to make reading/writing data frames efficient
Works in both Python and R. Still in early stages of development.
letters[4:6][1] "d" "e" "f"
letters[c(13,10,3)][1] "m" "j" "c"
myMatrix[1, 2:3]mydf['row1', 'b']mydf[1, 2]mydf['row1', 2]If the row/column value is empty, all rows/columns are retained.
mydf['row1',]
mydf[,'b']mydf[c(1,3),]mydf[mydf$a >= 2,][ : grab a slice of elements/columns
[[ : grab specific elements/columns
$ : grab specific elements/columns
@: extract slot for S4 objects
my_list_or_df[2:4]my_list_or_df[['name']]my_list_or_df$nameLogicals are objects with values of TRUE or FALSE.
Assume x is a vector of numbers.
x = c(-1, 2, 10, -5)
idx = x > 2
idx[1] FALSE FALSE TRUE FALSE
x[idx][1] 10
We don’t have to create an explicit index object first, as R indexing is ridiculously flexible.
x[x > 2]
x[x != 3]
x[ifelse(x > 2 & x !=10, T, F)]
x[{y = idx; y}]
x[resid(lm(y ~ x)) > 0]Consider the following unfortunately coded loop:
for (i in 1:nrow(mydf)) {
check = mydf$x[i] > 2
if (check==TRUE) {
mydf$y[i] = 'Yes'
}
else {
mydf$y[i] = 'No'
}
}Compare:
mydf$y = 'No'
mydf$y[mydf$x > 2] = 'Yes'
# alternative approach that doesn't use indexing
# mydf$y = ifelse(mydf$x > 2, 'Yes', 'No')This gets us the same thing, and would be much faster.
Boolean indexing is an example of a vectorized operation.
The whole vector is considered.
This is always faster.
Example: Log all values in a matrix.
mymatrix_log = log(mymatrix)Way faster than looping over elements, rows or columns. Here we’ll let the apply function stand in for our loop, logging the elements of each column.
mymatrix = matrix(runif(100), 10, 10)
identical(apply(mymatrix, 2, log), log(mymatrix))[1] TRUE
library(microbenchmark)
microbenchmark(apply(mymatrix, 2, log),
log(mymatrix))Unit: microseconds
expr min lq mean median uq max neval
apply(mymatrix, 2, log) 43.227 45.328 49.90904 45.928 48.780 133.883 100
log(mymatrix) 2.701 3.002 3.41332 3.302 3.302 16.210 100
Many vectorized functions already exist in R.
They are often written in C, Fortran etc., and so even faster.
A family of functions allows for a succinct way of looping.
Common ones include:
Standardizing variables.
for (i in 1:ncol(mydf)){
x = mydf[,i]
for (j in 1:length(x)){
x[j] = (x[j] - mean(x))/sd(x)
}
}The above would be a really bad way to use R. The following is much cleaner and now you’d have a function you can use elsewhere.
apply will take a matrix, and apply a function over the margin, row or column, you want.
stdize <- function(x) {
(x-mean(x))/sd(x)
}
apply(mydf, 2, stdize) # 1 for rows, 2 for columnwise applicationThe previous demonstrates how to use apply.
However, there is a scale function in base R that uses a more vectorized approach under the hood.
The following demonstrates various approaches to standardizing the columns of the matrix, even using a parallelized approach.
The base R function requires very little code and beats the others.
mydf = matrix(rnorm(100000), ncol=1000)
stdize <- function(x) {
(x-mean(x)) / sd(x)
}
doubleloop = function() {
for (i in 1:ncol(mydf)) {
x = mydf[, i]
for (j in 1:length(x)) {
x[j] = (x[j] - mean(x)) / sd(x)
}
}
}
singleloop = function() {
for (i in 1:ncol(mydf)) {
x = mydf[, i]
x = (x - mean(x)) / sd(x)
}
}
library(parallel)
cl = makeCluster(8)
clusterExport(cl, c('stdize', 'mydf'))
doParallel::registerDoParallel(cl)
test = microbenchmark::microbenchmark(doubleloop=doubleloop(),
singleloop=singleloop(),
apply=apply(mydf, 2, stdize),
parApply=parApply(cl, mydf, 2, stdize),
vectorized=scale(mydf), times=25)
stopCluster(cl)
testUnit: milliseconds
expr min lq mean median uq max neval
doubleloop 2882.148089 2904.739076 2938.21281 2946.57065 2952.25532 3013.29805 25
singleloop 28.989806 29.512434 33.08177 31.17037 32.42606 84.79185 25
apply 32.167901 33.350945 33.96146 34.13204 34.64055 36.18983 25
parApply 17.691611 18.213638 21.91323 19.63893 21.12156 73.60683 25
vectorized 8.085578 8.491433 10.25583 10.66990 11.02532 12.62173 25
Benefits
NOT necessarily faster than explicit loops.
ALWAYS can potentially be faster than loops.
I use R every day, and rarely use explicit loops.
I pretty much never use an explicit double loop, as a little more thinking about the problem will usually provide a more efficient path to solving the problem.
Apply functions and similar approaches should be a part of your regular R experience.
Other versions we’ll talk about have been optimized, but you need to know the basics in order to use those.
Any you still may need parallel versions.
With the following matrix, use apply and the sum function to get row or column sums.
x = matrix(1:9, 3, 3)
apply()With the following list, use lapply and sapply and the sum function to get sums for the elements. There is no margin to specify with on a list, so just supply the list and the sum function.
x = list(1:3, 4:6, 7:9)
lapply()
sapply()sapply is actually just a wrapper for lapply. If you supply the argument simplified=F, it is identical. Otherwise, it attempts to return a vector or matrix.
Note: More detail on much of this part is given in another workshop.
Pipes are operators that send what comes before to what comes after.
There are many different pipes, and some packages that use their own.
However, the vast majority of packages use the same pipe:
Here, we’ll focus on their use with the dplyr package. Later, we’ll use it for visualizations.
Example:
mydf %>%
select(var1, var2) %>%
filter(var1 == 'Yes') %>%
summaryStart with a data.frame %>%
   select columns from it %>%
   filter/subset it %>%
   get a summary
We can use variables as soon as they are created.
mydf %>%
mutate(newvar1 = var1 + var2,
newvar2 = newvar1/var3) %>%
summarise(newvar2avg = mean(newvar2))Generic example:
basegraph %>%
points %>%
lines %>%
layoutMost functions are not ‘pipe-aware’ by default.
Example: pipe to a modeling function.
mydf %>%
lm(y ~ x) # errorOther pipes could work in this situation.
But generally, one can use a dot.
mydf %>%
lm(y ~ x, data=.)Piping is not just for data.frames.
c('Ceci', "n'est", 'pas', 'une', 'pipe!') %>%
{
.. <- . %>%
if (length(.) == 1) .
else paste(.[1], '%>%', ..(.[-1]))
..(.)
} [1] "Ceci %>% n'est %>% pas %>% une %>% pipe!"
Put that in your pipe and smoke it René Magritte!
Pipes are best used interactively.
Extremely useful for data exploration.
Common in many visualization packages.
See the magrittr package for more pipes.
The tidyverse consists of a few key packages-
And of course the tidyverse package which will load all of the above.
See also: lubridate, rvest, stringr and others in the ‘hadleyverse’.
Tidy refers to data arranged in a way that makes analysis and visualization simpler.
In a tidy dataset:
Think long before wide.
Grammar of data manipulation.
Next iteration of plyr.
Focused on tools for working with data frames.
It has three main goals:
Some key operations:
Various join/merge functions:
Little things like:
No need to quote variable names.
Let’s say we want to select from our data the following variables:
How might we go about this?
Tedious, or multiple objects just to get the columns you want.
# numeric indexes; not conducive to readibility or reproducibility
newData = oldData[,c(1,2,3,4, etc.)]
# explicitly by name; fine if only a handful; not pretty
newData = oldData[,c('ID','X1', 'X2', etc.)]
# two step with grep; regex difficult to read/understand
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))
newData = oldData[,cols]
# or via subset
newData = subset(oldData, select = cols)What if you also want observations where Z is Yes, Q is No, and only the observations with the top 50 values of var2, ordered by var1 (descending)?
# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = newData[order(newData$var2, decreasing=T)[1:50],]
newData = newData[order(newData$var1, decreasing=T),]And this is for fairly straightforward operations.
newData = oldData %>%
filter(Z == 'Yes', Q == 'No') %>%
select(num_range('X', 1:10), contains('var'), starts_with('XYZ')) %>%
top_n(n=50, var2) %>%
arrange(desc(var1))dplyr and piping is an alternative
Even though the initial base R approach depicted is fairly concise, it still can potentially be:
Two primary functions for manipulating data
Other useful functions include:
library(tidyr)
stocks <- data.frame( time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4) )
stocks %>% head time X Y Z
1 2009-01-01 -1.13209158 0.90099099 1.3601499
2 2009-01-02 0.24670969 0.02995202 -0.4369245
3 2009-01-03 -0.18533572 0.54440515 -1.1737488
4 2009-01-04 0.17047486 -1.00201046 -3.9265851
5 2009-01-05 -0.02331142 -3.10298741 -7.2464021
6 2009-01-06 -1.29181733 -2.22339543 -0.3179180
stocks %>% gather(stock, price, -time) %>% head time stock price
1 2009-01-01 X -1.13209158
2 2009-01-02 X 0.24670969
3 2009-01-03 X -0.18533572
4 2009-01-04 X 0.17047486
5 2009-01-05 X -0.02331142
6 2009-01-06 X -1.29181733
Note that the latter is an example of tidy data while the former is not.
The dplyr grammar is clear for a lot of standard data processing tasks, and some not so common.
Extremely useful for data exploration and visualization.
Drawbacks:
multidplyr
data.table works in a notably different way than dplyr. However, you’d use it for the same reasons.
Like dplyr, the data objects are both data.frames and a package specific class.
Faster subset, grouping, update, ordered joins etc.
In general, data.table works with brackets as in base R.
However, the brackets work like a function call!
x[i, j, by, keyby, with = TRUE, ...]Importantly: you don’t use the brackets as you would with data.frames.
library(data.table)
dt = data.table(x=sample(1:10, 6), g=1:3, y=runif(6))
class(dt)[1] "data.table" "data.frame"
What i and j can be are fairly complex.
In general, you use i for filtering by rows.
dt[2]
dt[2,] x g y
1: 1 2 0.6158981
x g y
1: 1 2 0.6158981
In general, you use j to select (by name!) or create new columns.
dt[,x]
dt[,z := x+y] # dt now has a new column
dt[,z]
dt[g>1, mean(z), by=g]
dt[1] 3 1 10 5 7 6
x g y z
1: 3 1 0.01040662 3.010407
2: 1 2 0.61589805 1.615898
3: 10 3 0.50496462 10.504965
4: 5 1 0.57399499 5.573995
5: 7 2 0.32487892 7.324879
6: 6 3 0.28128785 6.281288
[1] 3.010407 1.615898 10.504965 5.573995 7.324879 6.281288
g V1
1: 2 4.470388
2: 3 8.393126
x g y z
1: 3 1 0.01040662 3.010407
2: 1 2 0.61589805 1.615898
3: 10 3 0.50496462 10.504965
4: 5 1 0.57399499 5.573995
5: 7 2 0.32487892 7.324879
6: 6 3 0.28128785 6.281288
Dropping columns is awkward.
dt[,-y] # creates negative values of y
dt[,-'y', with=F] # drops y, but now needs quotes
## dt[,y:=NULL] # drops y, but this is just a base R approach
## dt$y = NULL[1] -0.01040662 -0.61589805 -0.50496462 -0.57399499 -0.32487892 -0.28128785
x g z
1: 3 1 3.010407
2: 1 2 1.615898
3: 10 3 10.504965
4: 5 1 5.573995
5: 7 2 7.324879
6: 6 3 6.281288
group-by, with creation of a new variable.
Note that these actually modify dt in place, a key distinction with dplyr.
dt1 = dt2 = dt
dt[,sum(x,y), by=g] # sum of all x and y values g V1
1: 1 8.584402
2: 2 8.940777
3: 3 16.786252
dt1[,newvar := sum(x), by=g] # add new variable to the original data x g y z newvar
1: 3 1 0.01040662 3.010407 8
2: 1 2 0.61589805 1.615898 8
3: 10 3 0.50496462 10.504965 16
4: 5 1 0.57399499 5.573995 8
5: 7 2 0.32487892 7.324879 8
6: 6 3 0.28128785 6.281288 16
dt1 x g y z newvar
1: 3 1 0.01040662 3.010407 8
2: 1 2 0.61589805 1.615898 8
3: 10 3 0.50496462 10.504965 16
4: 5 1 0.57399499 5.573995 8
5: 7 2 0.32487892 7.324879 8
6: 6 3 0.28128785 6.281288 16
We can also create groupings on the fly.
For a new summary data set, we’ll take the following approach.
dt2[, list(meanx = mean(x), sumx = sum(x)), by=g==1] g meanx sumx
1: TRUE 4 8
2: FALSE 6 24
dt1[dt2]The following demonstrates some timings from here.
By the way, never, ever use aggregate. For anything.
fun elapsed
1: aggregate 114.35
2: by 24.51
3: sapply 11.62
4: tapply 11.33
5: dplyr 10.97
6: lapply 10.65
7: data.table 2.71
Ever.
Really.
Can be done but awkward at best.
mydt[,newvar:=mean(x),][,newvar2:=sum(newvar), by=group][,-'y', with=FALSE]
mydt[,newvar:=mean(x),
][,newvar2:=sum(newvar), by=group
][,-'y', with=FALSE
]Probably better to just use a pipe and dot approach.
mydt[,newvar:=mean(x),] %>%
.[,newvar2:=sum(newvar), by=group] %>%
.[,-'y', with=FALSE]Faster methods are great to have.
Drawbacks:
If speed and/or memory is (potentially) a concern, data.table
For interactive exploration, dplyr
Piping allows one to use both, so no need to choose.
And on the horizon…
Coming soon to an R near you.
This implements the data table back-end for ‘dplyr’ so that you can seamlessly use data table and ‘dplyr’ together.
Or play with now. The following shows times for a grouped operation of a data frame of two variables, a random uniform draw of 5e7 values, and a grouping variable of 500k groups.
| package | timing |
|---|---|
| dplyr | 10.97 |
| data.table | 2.71 |
| dtplyr | 2.70 |
Just for giggles I did the same in python with a pandas DataFrame and groupby operation, and it took 7+ seconds.
ggplot2 is an extremely popular package for visualization in R.
It entails a grammar of graphics.
Key ideas:
Strengths:
Aesthetics map data to aesthetic aspects of the plot.
The function used in ggplot to do this is aes
aes(x=myvar, y=myvar2, color=myvar3, group=g)In general, we start with a base layer and add to it.
In most cases you’ll start as follows.
ggplot(aes(x=myvar, y=myvar2), data=mydata)This would just produce a plot background.
Layers are added via piping.
The first layers added are typically geoms:
ggplot2 was using pipes before it was cool, and so it has a different pipe.
Otherwise, the concept is the same as before.
ggplot(aes(x=myvar, y=myvar2), data=mydata) +
geom_point()And now we would have a scatterplot.
library(ggplot2)
data("diamonds"); data('economics')
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point()ggplot(aes(x=date, y=unemploy), data=economics) +
geom_line()In the following, one setting is not mapped to the data.
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point(aes(size=carat, color=clarity), alpha=.25) There are many statistical functions built in.
Key strength: you don’t have to do much preprocessing.
Quantile regression lines:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_quantile()Loess (or additive model) smooth:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()Bootstrapped confidence intervals:
ggplot(mtcars, aes(cyl, mpg)) +
geom_point() +
stat_summary(fun.data = "mean_cl_boot", colour = "orange", alpha=.75, size = 1)Facets allow for paneled display, a very common operation.
In general, we often want comparison plots.
facet_grid will produce a grid.
facet_wrap is more flexible.
Both use a formula approach to specify the grouping.
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
facet_grid(vs ~ cyl, labeller = label_both)ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
facet_wrap(vs ~ cyl, labeller = label_both, ncol=2)ggplot2 makes it easy to get good looking graphs quickly.
However the amount of fine control is extensive.
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point(aes(color=clarity), alpha=.5) +
scale_y_log10(breaks=c(1000,5000,10000)) +
xlim(0, 10) +
scale_color_brewer(type='div') +
facet_wrap(~cut, ncol=3) +
theme_minimal() +
theme(axis.ticks.x=element_line(color='darkred'),
axis.text.x=element_text(angle=-45),
axis.text.y=element_text(size=20),
strip.text=element_text(color='forestgreen'),
strip.background=element_blank(),
panel.grid.minor=element_line(color='lightblue'),
legend.key=element_rect(linetype=4),
legend.position='bottom')In the last example you saw two uses of a theme.
Each argument takes on a specific value or an element function:
The base theme is not too good.
You will almost invariably need to tweak it.
ggplot2 now has its own extension system, and there is even a website to track the extensions.
Examples include:
ggplot2 is an easy to use, but powerful visualization tool.
Allows one to think in many dimensions for any graph:
2d graphs are only useful for conveying the simplest of ideas.
Use ggplot2 to easily create more interesting visualizations.
ggplot2 is the most widely used package for visualization in R.
However, it is not interactive by default.
Many packages use htmlwidgets, d3 (JavaScript library) etc. to provide interactive graphics.
General:
Specific functionality:
One of the advantages to piping is that it’s not limited to dplyr style data management functions.
Any R function can be potentially piped to.
This facilitates data exploration, especially visually.
Many newer visualization packages take advantage of piping.
htmlwidgets is a package that makes it easy to create javascript visualizations.
The packages using it typically are pipe-oriented and produce interactive plots.
A couple demonstrations with plotly.
Note the layering as with ggplot2.
Piping used before plotting.
library(plotly)
midwest %>%
filter(inmetro==T) %>%
plot_ly(x=~percbelowpoverty, y=~percollege, mode='markers') plotly has modes, which allow for points, lines, text and combinations.
Traces work similar to geoms.
library(mgcv); library(modelr)
mtcars %>%
mutate(amFactor = factor(am, labels=c('auto', 'manual')),
hovertext = paste(wt, mpg, amFactor)) %>%
add_predictions(gam(mpg~s(wt, am, bs='fs'), data=mtcars)) %>%
arrange(wt) %>%
plot_ly(x=~wt, y=~mpg, color=~amFactor, mode='markers') %>%
add_trace(x=~wt, y=~pred, color=~amFactor, alpha=.5, hover=~hovertext, name='gam prediction')The nice thing about plotly is that we can feed a ggplot to it.
It would have been easier to use geom_smooth, so let’s do so.
gp = mtcars %>%
mutate(amFactor = factor(am, labels=c('auto', 'manual')),
hovertext = paste(wt, mpg, amFactor),
prediction = predict(gam(mpg~s(wt), data=mtcars))) %>%
arrange(wt) %>%
ggplot(aes(x=wt, y=mpg)) +
geom_smooth() +
geom_point(aes(color=amFactor))
ggplotly()Highcharter is also fairly useful for a wide variety of plots.
library(highcharter); library("quantmod")
x <- getSymbols("GOOG", auto.assign = FALSE)
highchart(type='stock', width=1000) %>%
hc_add_series(data = x) visNetwork allows for network visualizations
library(visNetwork)
visNetwork(nodes, edges) %>% #, height=600, width=800
visNodes(shape='circle',
font=list(),
scaling=list(min=10, max=50, label=list(enable=T))) %>%
visLegend()Use the DT package for interactive dataframes.
library(DT)
ggplot2movies::movies %>%
select(1:6) %>%
filter(rating>9) %>%
slice(sample(1:nrow(.), 50)) %>%
datatable(rownames=F)Shiny is a framework that can essentially allow you to build an interactive website.
Most of the more recently developed visualization packages will work specifically within the shiny and rmarkdown settings.
Interactivity allows for even more dimensions to be brought to a graphic.
Interactive graphics are more fun too!
Just a couple visualization packages can go a very long way.
With the right tools, data exploration can be:
Use them to wring your data dry of what it has to offer.
Recommended next steps: